{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "W10_Homework.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "toc_visible": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "i79mWlGhdH2T"
      },
      "source": [
        "#CIS Week 10 Homework\n",
        "\n",
        "**Instructor:** Lyle Ungar\n",
        "\n",
        "**Content Creators:** Sanjeevini Ganni, Brittany Shields, Alessandra Rossi"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "27G1cTCwbFj4",
        "cellView": "form"
      },
      "source": [
        "#@markdown What is your Pennkey and pod? (text, not numbers, e.g. bfranklin)\n",
        "my_pennkey = 'bfranklin' #@param {type:\"string\"}\n",
        "my_pod = 'euclidean-wombat' #@param ['Select', 'euclidean-wombat', 'sublime-newt', 'buoyant-unicorn', 'lackadaisical-manatee','indelible-stingray','superfluous-lyrebird','discreet-reindeer','quizzical-goldfish','ubiquitous-cheetah','nonchalant-crocodile','fashionable-lemur','spiffy-eagle','electric-emu','quotidian-lion','astute-jellyfish', 'quantum-herring']\n",
        "\n",
        "# start timing\n",
        "import time\n",
        "try:t0;\n",
        "except NameError: t0 = time.time()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_dF811j6dfb8"
      },
      "source": [
        "We **strongly** recommend that you keep a separate document offline with your answers, and paste them in when you're ready to submit. Colab may reset and clear your notebook after a period of inactivity.\n",
        "\n",
        "Go to Runtime -> Change runtime type -> Set Hardware Accelerator to \"GPU\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wuuPO_pvdhcB"
      },
      "source": [
        "###**Learning Objectives**\n",
        "Finetune a Transformer model\n",
        "\n",
        "How might word embedding perpetuate systemic bias?\n",
        "\n",
        "How might we work with policymakers and companies to reduce the spread of fake news?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GQDE84gxsfgl"
      },
      "source": [
        "## Part I: Finetune GPT-2\n",
        "\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9r0hL5EKQEAd"
      },
      "source": [
        "GPT2 is a large-scale transformer-based language model developed by OpenAI. It is pre-trained on a large corpus of dataset of 8 million web pages has around 1.5 billion parameters. GPT2 is useful for language generation tasks, that is it predicts the next word, given some text.  \n",
        "\n",
        "In this excersice we will be finetuning DistillGPT-2, distilled version of GPT2. We want the language model to generate wikipedia style of text. Hence, we will finetune distillGPT on [wikitext-2-raw-v1](https://huggingface.co/datasets/wikitext) dataset scrapped from wikipedia articles."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "id": "on9-5HkDS8O0"
      },
      "source": [
        "#@title Install\n",
        "!pip install transformers\n",
        "!pip install datasets"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "xRPP3yuiTEiR"
      },
      "source": [
        "#@title Imports\n",
        "from datasets import load_dataset\n",
        "from transformers import AutoTokenizer\n",
        "from transformers import AutoModelForCausalLM\n",
        "from transformers import Trainer, TrainingArguments"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "id": "L-oWtmsYTH2b"
      },
      "source": [
        "#@title Load Datset\n",
        "datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OyRTTP1HTNyW"
      },
      "source": [
        "Example of Dataset:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "bLd2wFS6TMkn"
      },
      "source": [
        "datasets[\"train\"][48]"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Nkpb_pmfUxnp"
      },
      "source": [
        "####Tokenizer"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "x4LTI4c3TVRJ"
      },
      "source": [
        "#Define the model Checkpoint\n",
        "model_checkpoint = \"distilgpt2\"\n",
        "\n",
        "#Tokenizer  \n",
        "#To tokenize all our texts with the same vocabulary that was used when training the model, we have to download a pretrained tokenizer. \n",
        "#This is all done by the AutoTokenizer class:  \n",
        "tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)\n",
        "\n",
        "#We define a function to call the tokenizer on our texts:\n",
        "def tokenize_function(examples):\n",
        "    return tokenizer(examples[\"text\"])\n",
        "\n",
        "#Apply tokenizer to all the splits in our datasets object and drop the text coolumn\n",
        "tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=[\"text\"])"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "X-S7KN92UoqV"
      },
      "source": [
        "#we need to concatenate all our texts together then split the result in small chunks of a certain block_size\n",
        "#Define the block size for training\n",
        "block_size = 128\n",
        "\n",
        "#preprocessing function that will group our texts\n",
        "def group_texts(examples):\n",
        "    # Concatenate all texts.\n",
        "    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}\n",
        "    total_length = len(concatenated_examples[list(examples.keys())[0]])\n",
        "    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can\n",
        "        # customize this part to your needs.\n",
        "    total_length = (total_length // block_size) * block_size\n",
        "    # Split by chunks of max_len.\n",
        "    result = {\n",
        "        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]\n",
        "        for k, t in concatenated_examples.items()\n",
        "    }\n",
        "    result[\"labels\"] = result[\"input_ids\"].copy()\n",
        "    return result\n",
        "\n",
        "lm_datasets = tokenized_datasets.map(\n",
        "    group_texts,\n",
        "    batched=True,\n",
        "    batch_size=1000,\n",
        "    num_proc=4,\n",
        ")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9tdKwkgSV0n4"
      },
      "source": [
        "###Load pre-trained model"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "1nCy6io8Vl3g"
      },
      "source": [
        "#####Pretrained model\n",
        "model = AutoModelForCausalLM.from_pretrained(model_checkpoint)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AT23dcpEYQ6Z"
      },
      "source": [
        "###Run Generations before Finetuning"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "HHQ3tohnYT91"
      },
      "source": [
        "def run_generations(PROMPT_TEXT):\n",
        "  device = model.device\n",
        "  input_ids = tokenizer.encode(PROMPT_TEXT, return_tensors=\"pt\").to(device)\n",
        "\n",
        "  sample_output = model.generate(\n",
        "      input_ids, \n",
        "      do_sample=True, \n",
        "      max_length=100, \n",
        "      top_p=0.95,\n",
        "      num_return_sequences=3, \n",
        "      early_stopping=True\n",
        "  )\n",
        "\n",
        "\n",
        "  for i, output in enumerate(sample_output):\n",
        "    print(\"Generation: \"+ str(i) +\"\\n\" + 100 * '-')\n",
        "    print(tokenizer.decode(output, skip_special_tokens=True))\n",
        "    print()\n",
        "\n",
        "\n",
        "########\n",
        "PROMPT_TEXT = 'Fast and the Furious '\n",
        "run_generations(PROMPT_TEXT)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "4FmSG76QZ1jW"
      },
      "source": [
        "############## Run Generation for different prompt texts of your choice\n",
        "PROMPT_TEXT = ...\n",
        "run_generations(PROMPT_TEXT)\n",
        "\n",
        "\n",
        "PROMPT_TEXT = ...\n",
        "run_generations(PROMPT_TEXT)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1C6pem7UV6Y1"
      },
      "source": [
        "###Fine-Tuning (This will take around 20-25 minutes)"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "4EhRUYDbV_WD"
      },
      "source": [
        "training_args = TrainingArguments(\n",
        "    \"test-clm\",\n",
        "    evaluation_strategy = \"epoch\",\n",
        "    learning_rate=2e-5,\n",
        "    weight_decay=0.01,\n",
        "    num_train_epochs=2,\n",
        ")\n",
        "trainer = Trainer(\n",
        "    model=model,\n",
        "    args=training_args,\n",
        "    train_dataset=lm_datasets[\"train\"],\n",
        "    eval_dataset=lm_datasets[\"validation\"],\n",
        ")\n",
        "trainer.train()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "f6qyUdahW8yT"
      },
      "source": [
        "Perplexity of Validation Set"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Aes_z3-_W_SW"
      },
      "source": [
        "import math\n",
        "eval_results = trainer.evaluate()\n",
        "print(f\"Perplexity: {math.exp(eval_results['eval_loss']):.2f}\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "q2Bvbn-kXGDV"
      },
      "source": [
        "###Run Generations after Finetuning\n",
        "Run generations for different prompts and compare them with the initial generations."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "_lkGBXXXXMWm"
      },
      "source": [
        "PROMPT_TEXT = 'Fast and the Furious '\n",
        "run_generations(PROMPT_TEXT)\n",
        "\n",
        "############## Run Generation for different prompt texts of your choice\n",
        "PROMPT_TEXT = ...\n",
        "run_generations(PROMPT_TEXT)\n",
        "\n",
        "\n",
        "PROMPT_TEXT = ...\n",
        "run_generations(PROMPT_TEXT)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "id": "9vdFM5K_btoH"
      },
      "source": [
        "#@markdown Question 0: How do the generations after finetuning compare to the generations before finetuning. Do you think finetuning acheived its purpose of generating Wikipedia like texts? \n",
        "q0 = \"\" #@param{type:\"string\"}\n",
        "\n",
        "try:t0;\n",
        "except NameError: t0 = time.time()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CZ7iz5bsd0uF"
      },
      "source": [
        "## Part II: Primer on Gender Bias in Word Embeddings"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FQzJNYPseA_c"
      },
      "source": [
        "Read the following primer that describes the societal implications of gender bias in word embeddings: \\\\\n",
        "“Man is to Doctor as Woman is to Nurse: The Gender Bias of Word Embeddings: Why we should worry about gender inequality in Natural Language Processing techniques” \\\\\n",
        "https://towardsdatascience.com/gender-bias-word-embeddings-76d9806a0e17 \\\\\n",
        "*9 min read*, Tommaso Buonocore, 2019\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "m2yJzN7CeS3m",
        "cellView": "form"
      },
      "source": [
        "#@markdown Question 1: Identify a compelling quote or example from the article that resonated with you.  How does it emphasize the societal responsibility of data scientists?\n",
        "q1 = \"\" #@param{type:\"string\"}\n",
        "\n",
        "try:t1;\n",
        "except NameError: t1 = time.time()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "q3j4iFbnepdX"
      },
      "source": [
        "##Part III: Tech Paper on Gender Bias in Word Embeddings\n",
        "\n",
        "In the previous article, the author referenced that data scientists are introducing new strategies to reduce biases in word embeddings.  In 2016, researchers from Boston University and Microsoft Research published a paper with significant findings demonstrating the prevalence of gender bias in word embeddings, as well as proposed mitigation measures.  A 5-minute summary of the technical components of the paper can be found [here](https://towardsdatascience.com/tackling-gender-bias-in-word-embeddings-c965f4076a10) and the original paper can be found [here](https://arxiv.org/abs/1607.06520). \\\n",
        "\n",
        "“Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings” \\\n",
        "https://arxiv.org/abs/1607.06520 \\\n",
        "*30 min read*, 2016\n",
        "\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "VZXjlLvAdjk6",
        "cellView": "form"
      },
      "source": [
        "#@markdown Question 2: In the “Discussion” section, the authors consider the responsibility of machine language programmers with regards to bias in word embeddings.  Describe their stance.\n",
        "q2 = \"\" #@param{type:\"string\"}\n",
        "\n",
        "#@markdown Question 3: How could word embeddings amplify societal stereotypes?\n",
        "q3 = \"\" #@param{type:\"string\"}\n",
        "\n",
        "#@markdown Question 4: How could the authors’ proposed technological interventions address this concern?\n",
        "q4 = \"\" #@param{type:\"string\"}\n",
        "\n",
        "try:t2;\n",
        "except NameError: t2 = time.time()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Ex7vA7rfgpFq"
      },
      "source": [
        "##Part IV: Fake News"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "G9TVVjSogsvL"
      },
      "source": [
        "The following article provides an overview of neural fake news, including potential defenses against it as well as suggestions for further research.\n",
        "\n",
        "“An Exhaustive Guide to Detecting and Fighting Neural Fake News using NLP” \\\n",
        "•\thttps://www.analyticsvidhya.com/blog/2019/12/detect-fight-neural-fake-news-nlp/ \\\n",
        "•\t*16-minute read*, Analytics Vidhya, 2019\n",
        "\n",
        "In 2019, Jack Clark, the Policy Director at OpenAI, gave testimony to the US House Intelligence Committee.  You can either read his written 10-page testimony here or watch his live testimony here (jump to minute 11:00 to watch his 5 minute testimony).\n",
        "\n",
        "Written Testimony of Jack Clark, Policy Director, OpenAI \\\n",
        "•\thttps://intelligence.house.gov/uploadedfiles/clark_deepfakes_sfr.pdf \\\n",
        "•\tor \\\n",
        "•\thttps://www.c-span.org/video/?461679-1/house-intelligence-committee-hearing-deepfake-videos \\\n",
        "*Jump to minute 11:00* \\\n",
        "•\tHouse Permanent Select Committee on Intelligence, 2019 \\\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Cmh3fCcChEyw",
        "cellView": "form"
      },
      "source": [
        "#@markdown Question 5: Jack Clark enumerated several ideas for interventions.  Discuss one of the specific ideas he suggested.\n",
        "q5 = \"\" #@param{type:\"string\"}\n",
        "\n",
        "#@markdown Question 6: How might technologists work with policymakers and companies to prevent the spread of fake news?\n",
        "q6 = \"\" #@param{type:\"string\"}\n",
        "\n",
        "#@markdown Question 7: •\tOpenAI is not making GPT-3 publicly available, claiming that they are concerned that it might be used for doing bad. Do you think that was a good decision? Why?\n",
        "q7 = \"\" #@param{type:\"string\"}\n",
        "\n",
        "try:t3;\n",
        "except NameError: t3 = time.time()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "k5md7iimhYEo"
      },
      "source": [
        "## PART V: Know your Pod"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Yl_r2SsehjCI"
      },
      "source": [
        "With two other members of your pod, do the following. Have each of them recommend you their favorite song. Then, listen to those songs and write down your thoughts for each. (~100 words each)"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Je_h2CB-hdX8"
      },
      "source": [
        "know_a_pod_1 = \"\" #@param{type:\"string\"}\n",
        "know_a_pod_2 = \"\" #@param{type:\"string\"}\n",
        "\n",
        "try:t4;\n",
        "except NameError: t4 = time.time()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BS2TalTq2UE5"
      },
      "source": [
        "#Submission\n",
        "\n",
        "Once you're done, click on 'Share' and add the link to the box below."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "id": "dx3J50OUiz5D"
      },
      "source": [
        "link = '' #@param {type:\"string\"}"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "1Gf9eCXe2Vms"
      },
      "source": [
        "import time\n",
        "import numpy as np\n",
        "import urllib.parse\n",
        "from IPython import display\n",
        "from IPython.display import IFrame\n",
        "\n",
        "\n",
        "t5 = time.time()\n",
        "\n",
        "#@markdown #Run Cell to Show Airtable Form\n",
        "#@markdown ##**Confirm your answers and then click \"Submit\"**\n",
        "\n",
        "\n",
        "def prefill_form(src, fields: dict):\n",
        "  '''\n",
        "  src: the original src url to embed the form\n",
        "  fields: a dictionary of field:value pairs,\n",
        "  e.g. {\"pennkey\": my_pennkey, \"location\": my_location}\n",
        "  '''\n",
        "  prefill_fields = {}\n",
        "  for key in fields:\n",
        "      new_key = 'prefill_' + key\n",
        "      prefill_fields[new_key] = fields[key]\n",
        "  prefills = urllib.parse.urlencode(prefill_fields)\n",
        "  src = src + prefills\n",
        "  return src\n",
        "\n",
        "\n",
        "\n",
        "#autofill fields if they are not present\n",
        "#a missing pennkey and pod will result in an Airtable warning\n",
        "#which is easily fixed user-side.\n",
        "try: my_pennkey;\n",
        "except NameError: my_pennkey = \"\"\n",
        "\n",
        "try: my_pod;\n",
        "except NameError: my_pod = \"Select\"\n",
        "\n",
        "try: q0;\n",
        "except NameError: q0 = \"\"\n",
        "\n",
        "try: q1;\n",
        "except NameError: q1 = \"\"\n",
        "\n",
        "try: q2;\n",
        "except NameError: q2 = \"\"\n",
        "\n",
        "try: q3;\n",
        "except NameError: q3 = \"\"\n",
        "\n",
        "try: q4;\n",
        "except NameError: q4 = \"\"\n",
        "\n",
        "try: q5;\n",
        "except NameError: q5 = \"\"\n",
        "\n",
        "try: q6;\n",
        "except NameError: q6 = \"\"\n",
        "\n",
        "try: q7;\n",
        "except NameError: q7 = \"\"\n",
        "\n",
        "try: know_a_pod_1;\n",
        "except NameError: know_a_pod_1 = \"\"\n",
        "\n",
        "try: know_a_pod_2;\n",
        "except NameError: know_a_pod_2 = \"\"\n",
        "\n",
        "try: link;\n",
        "except NameError: link = \"\"\n",
        "\n",
        "fields = {\"pennkey\": my_pennkey,\n",
        "          \"pod\": my_pod,\n",
        "          \"q0\": q0,\n",
        "          \"q1\": q1,\n",
        "          \"q2\": q2,\n",
        "          \"q3\": q3,\n",
        "          \"q4\": q4,\n",
        "          \"q5\": q5,\n",
        "          \"q6\": q6,\n",
        "          \"q7\": q7,\n",
        "          \"know_a_pod_1\": know_a_pod_1,\n",
        "          \"know_a_pod_2\": know_a_pod_2,\n",
        "          \"link\": link}\n",
        "\n",
        "src = \"https://airtable.com/embed/shrgundiXUEJFVzwd?\"\n",
        "\n",
        "\n",
        "#now instead of the original source url, we do: src = prefill_form(src, fields)\n",
        "display.display(IFrame(src = prefill_form(src, fields), width = 800, height = 400))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "LcCRZDosYS0H"
      },
      "source": [
        ""
      ],
      "execution_count": null,
      "outputs": []
    }
  ]
}